Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
The Interspeech 2025 speech emotion recognition in natural istic conditions challenge builds on previous efforts to advance speech emotion recognition (SER) in real-world scenarios. The focus is on recognizing emotions from spontaneous speech, moving beyond controlled datasets. It provides a framework for speaker-independent training, development, and evaluation, with annotations for both categorical and dimensional tasks. The challenge attracted 93 research teams, whose models significantly improved state-of-the-art results over competitive baselines. This paper summarizes the challenge, focusing on the key outcomes. We analyze top-performing methods, emerging trends, and innovative directions. We highlight the effectiveness of combining foundational models based on audio and text to achieve robust SER systems. The competition website, with leaderboards, baseline code, and instructions, is available at: https://lab-msp.com/MSP-Podcast_Competition/IS2025/.more » « lessFree, publicly-accessible full text available August 17, 2026
-
There has been an increase in recognition of the important role that the boundary layer turbulent flow structure has on wake recovery and concomitant wind farm efficiency. Most research thus far has focused on onshore wind farms, in which the ground surface is static. With the expected growth of offshore wind farms, there is increased interest in turbulent flow structures above wavy, moving surfaces and their effects on offshore wind farms. In this study, experiments are performed to analyze the turbulent structure above the waves in the wake of a fixed-bottom model wind farm, with special emphasis on the conditional averaged Reynolds stresses, using a quadrant analysis. Phase-averaged profiles show a correlation between the Reynolds shear stresses and the curvature of the waves. Using a quadrant analysis, Reynolds stress dependence on the wave phase is observed in the phase-dependent vertical position of the turbulence events. This trend is primarily seen in quadrants 1 and 3 (correlated outward and inward interactions). Quantification of the correlation between the Reynolds shear stress events and the surface waves provides insight into the turbulent flow mechanisms that influence wake recovery throughout the wake region and should be taken into consideration in wind turbine operation and placement.more » « less
-
Recent studies have demonstrated the effectiveness of fine-tuning self-supervised speech representation models for speech emotion recognition (SER). However, applying SER in real-world environments remains challenging due to pervasive noise. Relying on low-accuracy predictions due to noisy speech can undermine the user’s trust. This paper proposes a unified self-supervised speech representation framework for enhanced speech emotion recognition designed to increase noise robustness in SER while generating enhanced speech. Our framework integrates speech enhancement (SE) and SER tasks, leveraging shared self-supervised learning (SSL)-derived features to improve emotion classification performance in noisy environments. This strategy encourages the SE module to enhance discriminative information for SER tasks. Additionally, we introduce a cascade unfrozen training strategy, where the SSL model is gradually unfrozen and fine-tuned alongside the SE and SER heads, ensuring training stability and preserving the generalizability of SSL representations. This approach demonstrates improvements in SER performance under unseen noisy conditions without compromising SE quality. When tested at a 0 dB signal-to-noise ratio (SNR) level, our proposed method outperforms the original baseline by 3.7% in F1-Macro and 2.7% in F1-Micro scores, where the differences are statistically significant.more » « lessFree, publicly-accessible full text available April 6, 2026
-
An important task in human-computer interaction is to rank speech samples according to their expressive content. A preference learning framework is appropriate for obtaining an emotional rank for a set of speech samples. However, obtaining reliable labels for training a preference learning framework is a challenging task. Most existing databases provide sentence-level absolute attribute scores annotated by multiple raters, which have to be transformed to obtain preference labels. Previous studies have shown that evaluators anchor their absolute assessments on previously annotated samples. Hence, this study proposes a novel formulation for obtaining preference learning labels by only considering annotation trends assigned by a rater to consecutive samples within an evaluation session. The experiments show that the use of the proposed anchor-based ordinal labels leads to significantly better performance than models trained using existing alternative labels.more » « less
-
Audio-visual emotion recognition (AVER) has been an important research area in human-computer interaction (HCI). Traditionally, audio-visual emotional datasets and corresponding models derive their ground truths from annotations obtained by raters after watching the audio-visual stimuli. This conventional method, however, neglects the nuanced human perception of emotional states, which varies when annotations are made under different emotional stimuli conditions—whether through unimodal or multimodal stimuli. This study investigates the potential for enhanced AVER system performance by integrating diverse levels of annotation stimuli, reflective of varying perceptual evaluations. We propose a two-stage training method to train models with the labels elicited by audio-only, face-only, and audio-visual stimuli. Our approach utilizes different levels of annotation stimuli according to which modality is present within different layers of the model, effectively modeling annotation at the unimodal and multi-modal levels to capture the full scope of emotion perception across unimodal and multimodal contexts. We conduct the experiments and evaluate the models on the CREMA-D emotion database. The proposed methods achieved the best performances in macro-/weighted-F1 scores. Additionally, we measure the model calibration, performance bias, and fairness metrics considering the age, gender, and race of the AVER systems.more » « less
-
na (Ed.)The field of speech emotion recognition (SER) aims to create scientifically rigorous systems that can reliably characterize emotional behaviors expressed in speech. A key aspect for building SER systems is to obtain emotional data that is both reliable and reproducible for practitioners. However, academic researchers encounter difficulties in accessing or collecting naturalistic, large-scale, reliable emotional recordings. Also, the best practices for data collection are not necessarily described or shared when presenting emotional corpora. To address this issue, the paper proposes the creation of an affective naturalistic database consortium (AndC) that can encourage multidisciplinary cooperation among researchers and practitioners in the field of affective computing. This paper’s contribution is twofold. First, it proposes the design of the AndC with a customizable-standard framework for intelligently-controlled emotional data collection. The focus is on leveraging naturalistic spontaneous record- ings available on audio-sharing websites. Second, it presents as a case study the development of a naturalistic large-scale Taiwanese Mandarin podcast corpus using the customizable- standard intelligently-controlled framework. The AndC will en- able research groups to effectively collect data using the provided pipeline and to contribute with alternative algorithms or data collection protocols.more » « less
-
When selecting test data for subjective tasks, most studies define ground truth labels using aggregation methods such as the majority or plurality rules. These methods discard data points without consensus, making the test set easier than practical tasks where a prediction is needed for each sample. However, the discarded data points often express ambiguous cues that elicit coexisting traits perceived by annotators. This paper addresses the importance of considering all the annotations and samples in the data, highlighting that only showing the model’s performance on an incomplete test set selected by using the majority or plurality rules can lead to bias in the models’ performances. We focus on speech-emotion recognition (SER) tasks. We observe that traditional aggregation rules have a data loss ratio ranging from 5.63% to 89.17%. From this observation, we propose a flexible method named the all-inclusive aggregation rule to evaluate SER systems on the complete test data. We contrast traditional single-label formulations with a multi-label formulation to consider the coexistence of emotions. We show that training an SER model with the data selected by the all-inclusive aggregation rule shows consistently higher macro-F1 scores when tested in the entire test set, including ambiguous samples without agreement.more » « less
-
Abstract Surface performance is critically influenced by topography in virtually all real-world applications. The current standard practice is to describe topography using one of a few industry-standard parameters. The most commonly reported number is$$R$$ a, the average absolute deviation of the height from the mean line (at some, not necessarily known or specified, lateral length scale). However, other parameters, particularly those that are scale-dependent, influence surface and interfacial properties; for example the local surface slope is critical for visual appearance, friction, and wear. The present Surface-Topography Challenge was launched to raise awareness for the need of a multi-scale description, but also to assess the reliability of different metrology techniques. In the resulting international collaborative effort, 153 scientists and engineers from 64 research groups and companies across 20 countries characterized statistically equivalent samples from two different surfaces: a “rough” and a “smooth” surface. The results of the 2088 measurements constitute the most comprehensive surface description ever compiled. We find wide disagreement across measurements and techniques when the lateral scale of the measurement is ignored. Consensus is established through scale-dependent parameters while removing data that violates an established resolution criterion and deviates from the majority measurements at each length scale. Our findings suggest best practices for characterizing and specifying topography. The public release of the accumulated data and presented analyses enables global reuse for further scientific investigation and benchmarking.more » « lessFree, publicly-accessible full text available September 1, 2026
-
null (Ed.)The performance of facial expression recognition (FER) systems has improved with recent advances in machine learning. While studies have reported impressive accuracies in detecting emotion from posed expressions in static images, there are still important challenges in developing FER systems for videos, especially in the presence of speech. Speech articulation modulates the orofacial area, changing the facial appearance. These facial movements induced by speech introduce noise, reducing the performance of an FER system. Solving this problem is important if we aim to study more naturalistic environment or applications in the wild. We propose a novel approach to compensate for lexical information that does not require phonetic information during inference. The approach relies on a style extractor model, which creates emotional-to-neutral transformations. The transformed facial representations are spatially contrasted with the original faces, highlighting the emotional information conveyed in the video. The results demonstrate that adding the proposed style extractor model to a dynamic FER system improves the performance by 7% (absolute) compared to a similar model with no style extractor. This novel feature representation also improves the generaliza- tion of the model.more » « less
-
Geophysical flows occur over a large range of scales, with Reynolds numbers and Richardson numbers varying over several orders of magnitude. For this study, jets of different densities were ejected vertically into a large ambient region, considering conditions relevant to some geophysical phenomena. Using particle image velocimetry, the velocity fields were measured for three different gases exhausting into air – specifically helium, air and argon. Measurements focused on both the jet core and the entrained ambient. Experiments considered relatively low Reynolds numbers from approximately 1500 to 10 000 with Richardson numbers near 0.001 in magnitude. These included a variety of flow responses, notably a nearly laminar jet, turbulent jets and a transitioning jet in between. Several features were studied, including the jet development, the local entrainment ratio, the turbulent Reynolds stresses and the eddy strength. Compared to a fully turbulent jet, the transitioning jet showed up to 50 % higher local entrainment and more significant turbulent fluctuations. For this condition, the eddies were non-axisymmetric and larger than the exit radius. For turbulent jets, the eddies were initially smaller and axisymmetric while growing with the shear layer. At lower turbulent Reynolds number, the turbulent stresses were more than 50 % higher than at higher turbulent Reynolds number. In either case, the low-density jet developed faster than a comparable non-buoyant jet. Quadrant analysis and proper orthogonal decomposition were also utilized for insight into the entrainment of the jet, as well as to assess the energy distribution with respect to the number of eigenmodes. Reynolds shear stresses were dominant in Q1 and Q3 and exhibited negligible contributions from the remaining two quadrants. Both analysis techniques showed that the development of stresses downstream was dependent on the Reynolds number while the spanwise location of the stresses depended on the Richardson number.more » « less
An official website of the United States government
